Pulaski County
A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research
Chung, Simon, Vorland, Colby J., Maney, Donna L., Brown, Andrew W.
Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
Embedding Reliability Verification Constraints into Generation Expansion Planning
Liu, Peng, Cheng, Lian, Omell, Benjamin P., Burgard, Anthony P.
Generation planning approaches face challenges in managing the incompatible mathematical structures between stochastic production simulations for reliability assessment and optimization models for generation planning, which hinders the integration of reliability constraints. This study proposes an approach to embedding reliability verification constraints into generation expansion planning by leveraging a weighted oblique decision tree (WODT) technique. For each planning year, a generation mix dataset, labeled with reliability assessment simulations, is generated. An WODT model is trained using this dataset. Reliability-feasible regions are extracted via depth-first search technique and formulated as disjunctive constraints. These constraints are then transformed into mixed-integer linear form using a convex hull modeling technique and embedded into a unit commitment-integrated generation expansion planning model. The proposed approach is validated through a long-term generation planning case study for the Electric Reliability Council of Texas (ERCOT) region, demonstrating its effectiveness in achieving reliable and optimal planning solutions.
Building Machine Learning Challenges for Anomaly Detection in Science
Campolongo, Elizabeth G., Chou, Yuan-Tang, Govorkova, Ekaterina, Bhimji, Wahid, Chao, Wei-Lun, Harris, Chris, Hsu, Shih-Chieh, Lapp, Hilmar, Neubauer, Mark S., Namayanja, Josephine, Subramanian, Aneesh, Harris, Philip, Anand, Advaith, Carlyn, David E., Ghosh, Subhankar, Lawrence, Christopher, Moreno, Eric, Raikman, Ryan, Wu, Jiaman, Zhang, Ziheng, Adhi, Bayu, Gharehtoragh, Mohammad Ahmadi, Monsalve, Saúl Alonso, Babicz, Marta, Baig, Furqan, Banerji, Namrata, Bardon, William, Barna, Tyler, Berger-Wolf, Tanya, Dieng, Adji Bousso, Brachman, Micah, Buat, Quentin, Hui, David C. Y., Cao, Phuong, Cerino, Franco, Chang, Yi-Chun, Chaulagain, Shivaji, Chen, An-Kai, Chen, Deming, Chen, Eric, Chou, Chia-Jui, Ciou, Zih-Chen, Cochran-Branson, Miles, Choi, Artur Cordeiro Oudot, Coughlin, Michael, Cremonesi, Matteo, Dadarlat, Maria, Darch, Peter, Desai, Malina, Diaz, Daniel, Dillmann, Steven, Duarte, Javier, Duporge, Isla, Ekka, Urbas, Heravi, Saba Entezari, Fang, Hao, Flynn, Rian, Fox, Geoffrey, Freed, Emily, Gao, Hang, Gao, Jing, Gonski, Julia, Graham, Matthew, Hashemi, Abolfazl, Hauck, Scott, Hazelden, James, Peterson, Joshua Henry, Hoang, Duc, Hu, Wei, Huennefeld, Mirco, Hyde, David, Janeja, Vandana, Jaroenchai, Nattapon, Jia, Haoyi, Kang, Yunfan, Kholiavchenko, Maksim, Khoda, Elham E., Kim, Sangin, Kumar, Aditya, Lai, Bo-Cheng, Le, Trung, Lee, Chi-Wei, Lee, JangHyeon, Lee, Shaocheng, van der Lee, Suzan, Lewis, Charles, Li, Haitong, Li, Haoyang, Liao, Henry, Liu, Mia, Liu, Xiaolin, Liu, Xiulong, Loncar, Vladimir, Lyu, Fangzheng, Makarov, Ilya, Mao, Abhishikth Mallampalli Chen-Yu, Michels, Alexander, Migala, Alexander, Mokhtar, Farouk, Morlighem, Mathieu, Namgung, Min, Novak, Andrzej, Novick, Andrew, Orsborn, Amy, Padmanabhan, Anand, Pan, Jia-Cheng, Pandya, Sneh, Pei, Zhiyuan, Peixoto, Ana, Percivall, George, Leung, Alex Po, Purushotham, Sanjay, Que, Zhiqiang, Quinnan, Melissa, Ranjan, Arghya, Rankin, Dylan, Reissel, Christina, Riedel, Benedikt, Rubenstein, Dan, Sasli, Argyro, Shlizerman, Eli, Singh, Arushi, Singh, Kim, Sokol, Eric R., Sorensen, Arturo, Su, Yu, Taheri, Mitra, Thakkar, Vaibhav, Thomas, Ann Mariam, Toberer, Eric, Tsai, Chenghan, Vandewalle, Rebecca, Verma, Arjun, Venterea, Ricco C., Wang, He, Wang, Jianwu, Wang, Sam, Wang, Shaowen, Watts, Gordon, Weitz, Jason, Wildridge, Andrew, Williams, Rebecca, Wolf, Scott, Xu, Yue, Yan, Jianqi, Yu, Jai, Zhang, Yulei, Zhao, Haoran, Zhao, Ying, Zhong, Yibo
Scientific discoveries are often made by finding a pattern or object that was not predicted by the known rules of science. Oftentimes, these anomalous events or objects that do not conform to the norms are an indication that the rules of science governing the data are incomplete, and something new needs to be present to explain these unexpected outliers. The challenge of finding anomalies can be confounding since it requires codifying a complete knowledge of the known scientific behaviors and then projecting these known behaviors on the data to look for deviations. When utilizing machine learning, this presents a particular challenge since we require that the model not only understands scientific data perfectly but also recognizes when the data is inconsistent and out of the scope of its trained behavior. In this paper, we present three datasets aimed at developing machine learning-based anomaly detection for disparate scientific domains covering astrophysics, genomics, and polar science. We present the different datasets along with a scheme to make machine learning challenges around the three datasets findable, accessible, interoperable, and reusable (FAIR). Furthermore, we present an approach that generalizes to future machine learning challenges, enabling the possibility of large, more compute-intensive challenges that can ultimately lead to scientific discovery.
Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging
Wang, Shansong, Safari, Mojtaba, Li, Qiang, Chang, Chih-Wei, Qiu, Richard LJ, Roper, Justin, Yu, David S., Yang, Xiaofeng
Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent.
Assessing and Prioritizing Ransomware Risk Based on Historical Victim Data
Massengale, Spencer, Huff, Philip
We present an approach to identifying which ransomware adversaries are most likely to target specific entities, thereby assisting these entities in formulating better protection strategies. Ransomware poses a formidable cybersecurity threat characterized by profit-driven motives, a complex underlying economy supporting criminal syndicates, and the overt nature of its attacks. This type of malware has consistently ranked among the most prevalent, with a rapid escalation in activity observed. Recent estimates indicate that approximately two-thirds of organizations experienced ransomware attacks in 2023 \cite{Sophos2023Ransomware}. A central tactic in ransomware campaigns is publicizing attacks to coerce victims into paying ransoms. Our study utilizes public disclosures from ransomware victims to predict the likelihood of an entity being targeted by a specific ransomware variant. We employ a Large Language Model (LLM) architecture that uses a unique chain-of-thought, multi-shot prompt methodology to define adversary SKRAM (Skills, Knowledge, Resources, Authorities, and Motivation) profiles from ransomware bulletins, threat reports, and news items. This analysis is enriched with publicly available victim data and is further enhanced by a heuristic for generating synthetic data that reflects victim profiles. Our work culminates in the development of a machine learning model that assists organizations in prioritizing ransomware threats and formulating defenses based on the tactics, techniques, and procedures (TTP) of the most likely attackers.
Improving Legal Entity Recognition Using a Hybrid Transformer Model and Semantic Filtering Approach
Legal Entity Recognition (LER) involves identifying key entities such as parties, dates, monetary amounts, and legal provisions from legal documents. Automating this process is crucial for improving efficiency in legal workflows, including contract review, compliance monitoring, and litigation support. Traditional Named Entity Recognition (NER) methods, such as rule-based systems and classical machine learning models like Conditional Random Fields (CRFs), require extensive feature engineering and struggle to adapt to new legal terminologies. Transformer-based models, particularly BERT [1], have shown great promise in various NLP tasks, including LER. **Legal-BERT**, a finetuned variant of BERT for legal texts, has demonstrated superior performance
EVOLvE: Evaluating and Optimizing LLMs For Exploration
Nie, Allen, Su, Yi, Chang, Bo, Lee, Jonathan N., Chi, Ed H., Le, Quoc V., Chen, Minmin
Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
Environment Scan of Generative AI Infrastructure for Clinical and Translational Science
Idnay, Betina, Xu, Zihan, Adams, William G., Adibuzzaman, Mohammad, Anderson, Nicholas R., Bahroos, Neil, Bell, Douglas S., Bumgardner, Cody, Campion, Thomas, Castro, Mario, Cimino, James J., Cohen, I. Glenn, Dorr, David, Elkin, Peter L, Fan, Jungwei W., Ferris, Todd, Foran, David J., Hanauer, David, Hogarth, Mike, Huang, Kun, Kalpathy-Cramer, Jayashree, Kandpal, Manoj, Karnik, Niranjan S., Katoch, Avnish, Lai, Albert M., Lambert, Christophe G., Li, Lang, Lindsell, Christopher, Liu, Jinze, Lu, Zhiyong, Luo, Yuan, McGarvey, Peter, Mendonca, Eneida A., Mirhaji, Parsa, Murphy, Shawn, Osborne, John D., Paschalidis, Ioannis C., Harris, Paul A., Prior, Fred, Shaheen, Nicholas J., Shara, Nawar, Sim, Ida, Tachinardi, Umberto, Waitman, Lemuel R., Wright, Rosalind J., Zai, Adrian H., Zheng, Kai, Lee, Sandra Soo-Jin, Malin, Bradley A., Natarajan, Karthik, Price, W. Nicholson II, Zhang, Rui, Zhang, Yiye, Xu, Hua, Bian, Jiang, Weng, Chunhua, Peng, Yifan
This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the Clinical and Translational Science Award (CTSA) Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) at the United States. With the rapid advancement of GenAI technologies, including large language models (LLMs), healthcare institutions face unprecedented opportunities and challenges. This research explores the current status of GenAI integration, focusing on stakeholder roles, governance structures, and ethical considerations by administering a survey among leaders of health institutions (i.e., representing academic medical centers and health systems) to assess the institutional readiness and approach towards GenAI adoption. Key findings indicate a diverse range of institutional strategies, with most organizations in the experimental phase of GenAI deployment. The study highlights significant variations in governance models, with a strong preference for centralized decision-making but notable gaps in workforce training and ethical oversight. Moreover, the results underscore the need for a more coordinated approach to GenAI governance, emphasizing collaboration among senior leaders, clinicians, information technology staff, and researchers. Our analysis also reveals concerns regarding GenAI bias, data security, and stakeholder trust, which must be addressed to ensure the ethical and effective implementation of GenAI technologies. This study offers valuable insights into the challenges and opportunities of GenAI integration in healthcare, providing a roadmap for institutions aiming to leverage GenAI for improved quality of care and operational efficiency.
Abstractive Text Summarization: State of the Art, Challenges, and Improvements
Shakil, Hassan, Farooq, Ahmad, Kalita, Jugal
Specifically focusing on the landscape of abstractive text summarization, as opposed to extractive techniques, this survey presents a comprehensive overview, delving into state-of-the-art techniques, prevailing challenges, and prospective research directions. We categorize the techniques into traditional sequence-to-sequence models, pre-trained large language models, reinforcement learning, hierarchical methods, and multi-modal summarization. Unlike prior works that did not examine complexities, scalability and comparisons of techniques in detail, this review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements - providing researchers an extensive overview to advance abstractive summarization research. We provide vital comparison tables across techniques categorized - offering insights into model complexity, scalability and appropriate applications. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics, among others. Solutions leveraging knowledge incorporation and other innovative strategies are proposed to address these challenges. The paper concludes by highlighting emerging research areas like factual inconsistency, domain-specific, cross-lingual, multilingual, and long-document summarization, as well as handling noisy data. Our objective is to provide researchers and practitioners with a structured overview of the domain, enabling them to better understand the current landscape and identify potential areas for further research and improvement.
MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert Segmentations
Garrucho, Lidia, Reidel, Claire-Anne, Kushibar, Kaisar, Joshi, Smriti, Osuala, Richard, Tsirikoglou, Apostolia, Bobowicz, Maciej, del Riego, Javier, Catanese, Alessandro, Gwoździewicz, Katarzyna, Cosaka, Maria-Laura, Abo-Elhoda, Pasant M., Tantawy, Sara W., Sakrana, Shorouq S., Shawky-Abdelfatah, Norhan O., Abdo-Salem, Amr Muhammad, Kozana, Androniki, Divjak, Eugen, Ivanac, Gordana, Nikiforaki, Katerina, Klontzas, Michail E., García-Dosdá, Rosa, Gulsun-Akpinar, Meltem, Lafcı, Oğuz, Mann, Ritse, Martín-Isla, Carlos, Prior, Fred, Marias, Kostas, Starmans, Martijn P. A., Strand, Fredrik, Díaz, Oliver, Igual, Laura, Lekadir, Karim
Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning.